by Amir Ebrahimi
The analysis that follows make use of the the DOT’s quarterly airfare data from the Domestic Airline Consumer Airfare Report. The reports and airfare data can be found in original form at:
http://www.transportation.gov/policy/aviation-policy/domestic-airline-consumer-airfare-report
My goals for this analysis are primarily to have an informed perspective of air travel from actual data. Secondarily, I’m hoping that some practical facts emerge that any traveler could benefit from. I will perform univariate, bivariate, and multivariate analysis in order to achieve these goals. Finally, I will attempt to build a linear model that can account for relationships among the data.
Some records were missing the information for the low and large fare carriers. Others had erroneous fare data. Incomplete records are not that useful to me in the analysis and since the NA data is less than 1% of the overall data I am deciding to exclude these records from the analysis. The summary below shows the individual NA counts per variable.
## Year quarter nsmiles airport_1
## Min. :2001 Min. :1.000 Min. : 101.0 DAL :174
## 1st Qu.:2005 1st Qu.:1.000 1st Qu.: 837.8 MDW :117
## Median :2009 Median :2.000 Median :1190.0 JFK : 62
## Mean :2008 Mean :2.414 Mean :1174.3 EWR : 55
## 3rd Qu.:2012 3rd Qu.:3.000 3rd Qu.:1611.0 HOU : 50
## Max. :2014 Max. :4.000 Max. :2620.0 LGB : 49
## NA's :5 (Other):934
## airport_2 passengers fare carrier_lg
## MDW :174 Min. : 0.1000 Min. : 63.95 99 : 4
## LGB : 75 1st Qu.: 0.1099 1st Qu.: 215.03 CO : 3
## JFK : 64 Median : 0.2174 Median : 294.33 G4 : 2
## PIE : 61 Mean : 0.7139 Mean : 341.60 AA : 1
## SWF : 61 3rd Qu.: 0.3297 3rd Qu.: 407.50 US : 1
## HOU : 60 Max. :137.0000 Max. :2133.00 (Other): 0
## (Other):946 NA's :1430
## large_ms fare_lg carrier_low lf_ms
## Min. :0.0454 Min. : 95.76 99 : 4 Min. :1
## 1st Qu.:0.0670 1st Qu.:121.76 G4 : 2 1st Qu.:1
## Median :0.0800 Median :196.00 AA : 1 Median :1
## Mean :0.4656 Mean :256.97 3M : 0 Mean :1
## 3rd Qu.:1.0000 3rd Qu.:322.75 9K : 0 3rd Qu.:1
## Max. :1.0000 Max. :618.00 (Other): 0 Max. :1
## NA's :1434 NA's :1434 NA's :1434 NA's :1438
## fare_low
## Min. : 95.76
## 1st Qu.:102.64
## Median :109.52
## Mean :113.09
## 3rd Qu.:121.76
## Max. :134.00
## NA's :1438
## [1] "Omitting 1441 NA records (0.91% of all records)"
The variables in the dataset are:
## [1] "Year" "quarter" "nsmiles" "airport_1" "airport_2"
## [6] "passengers" "fare" "carrier_lg" "large_ms" "fare_lg"
## [11] "carrier_low" "lf_ms" "fare_low" "haul"
The structure of the dataset and some sample values are:
## Classes 'tbl_df' and 'data.frame': 156170 obs. of 14 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ quarter : num 1 1 1 1 1 1 1 1 1 1 ...
## $ nsmiles : int 1300 151 1308 1347 1351 160 1339 646 654 963 ...
## $ airport_1 : Factor w/ 313 levels "ABE","ABI","ABQ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ airport_2 : Factor w/ 299 levels "ATL","AUS","AVL",..: 48 51 54 64 106 113 115 160 195 278 ...
## $ passengers : num 0.889 9.444 41.667 0.222 2.222 ...
## $ fare : num 130 298 355 175 322 ...
## $ carrier_lg : Factor w/ 39 levels "3M","99","9K",..: 9 33 33 9 11 32 33 24 32 33 ...
## $ large_ms : num 0.75 0.965 0.421 1 0.7 ...
## $ fare_lg : num 149 304 337 175 358 ...
## $ carrier_low: Factor w/ 49 levels "3M","99","9K",..: 13 42 42 11 29 40 42 29 40 42 ...
## $ lf_ms : num 0.25 0.965 0.421 1 0.2 ...
## $ fare_low : num 74.5 304.1 337.4 174.5 129.3 ...
## $ haul : Factor w/ 2 levels "short","medium": 2 1 2 2 2 1 2 2 2 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1441] 33 95 194 266 270 427 430 432 441 579 ...
## .. ..- attr(*, "names")= chr [1:1441] "33" "95" "194" "266" ...
Airports and carriers are factored variables in the dataset. The full names for the airport and carriers can be looked up by code at IATA.
## $airport_1
## [1] "ABE" "ABI" "ABQ" "ABY" "ACK" "ACT" "ACV" "ACY" "AEX" "AGS" "ALB"
## [12] "ALO" "ALW" "AMA" "APF" "ART" "ASE" "ATL" "ATW" "AUS" "AUW" "AVL"
## [23] "AVP" "AZA" "AZO" "BDL" "BFL" "BGM" "BGR" "BHB" "BHM" "BIL" "BIS"
## [34] "BJI" "BKG" "BLI" "BLV" "BMI" "BNA" "BOI" "BOS" "BPT" "BQK" "BRO"
## [45] "BTM" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CEC" "CHA"
## [56] "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL" "CLT" "CMH" "CMI"
## [67] "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG" "CVG" "CWA" "CYS"
## [78] "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN" "DIK" "DLH"
## [89] "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "EKO" "ELM" "ELP"
## [100] "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA" "FLG"
## [111] "FLL" "FLO" "FNL" "FNT" "FOE" "FSD" "FSM" "FWA" "FYV" "GCC" "GEG"
## [122] "GFK" "GJT" "GNV" "GPT" "GRB" "GRI" "GRK" "GRR" "GSO" "GSP" "GTF"
## [133] "GTR" "GUC" "GYY" "HDN" "HFD" "HGR" "HHH" "HLN" "HOB" "HOU" "HPN"
## [144] "HRL" "HSV" "HTS" "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "IFP"
## [155] "ILE" "ILG" "ILM" "IND" "INL" "IPL" "IPT" "ISN" "ISP" "ITH" "JAC"
## [166] "JAN" "JAX" "JFK" "JLN" "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LBE"
## [177] "LCH" "LCK" "LEB" "LEX" "LFT" "LGA" "LGB" "LIT" "LMT" "LNK" "LRD"
## [188] "LSE" "LWB" "LWS" "LYH" "MAF" "MBS" "MCI" "MCO" "MDT" "MDW" "MEI"
## [199] "MEM" "MFE" "MFR" "MGM" "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB"
## [210] "MLI" "MLU" "MMH" "MOB" "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP"
## [221] "MSY" "MTJ" "MVY" "MYR" "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD"
## [232] "ORF" "ORH" "ORL" "OTH" "PAH" "PBG" "PBI" "PDX" "PFN" "PGV" "PHF"
## [243] "PHL" "PHX" "PIA" "PIE" "PIH" "PIT" "PLN" "PNS" "PQI" "PSC" "PSM"
## [254] "PSP" "PUW" "PVD" "PWM" "RAP" "RDD" "RDM" "RDU" "RFD" "RHI" "RIC"
## [265] "RKS" "RNO" "ROA" "ROC" "ROW" "RST" "RSW" "SAF" "SAN" "SAT" "SAV"
## [276] "SBA" "SBN" "SBP" "SBY" "SCE" "SDF" "SEA" "SFO" "SGF" "SGU" "SHV"
## [287] "SJC" "SJT" "SLC" "SMF" "SNA" "SPI" "SRQ" "STL" "STS" "SUN" "SWF"
## [298] "SYR" "TLH" "TOL" "TPA" "TRI" "TUL" "TUS" "TVC" "TXK" "TYR" "TYS"
## [309] "VPS" "XNA" "YKM" "YNG" "YUM"
##
## $airport_2
## [1] "ATL" "AUS" "AVL" "AVP" "AZA" "BDL" "BFL" "BHM" "BIL" "BKG" "BLI"
## [12] "BMI" "BNA" "BOI" "BOS" "BTR" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE"
## [23] "CAK" "CEC" "CHA" "CHI" "CHO" "CHS" "CIC" "CID" "CLD" "CLE" "CLL"
## [34] "CLT" "CMH" "CMI" "CMX" "COD" "COS" "COU" "CPR" "CRP" "CRW" "CSG"
## [45] "CVG" "CWA" "DAB" "DAL" "DAY" "DBQ" "DCA" "DEC" "DEN" "DFW" "DHN"
## [56] "DIK" "DLH" "DRO" "DSM" "DTW" "EAT" "EAU" "ECP" "EFD" "EGE" "ELM"
## [67] "ELP" "ERI" "EUG" "EVV" "EWN" "EWR" "EYW" "FAR" "FAT" "FAY" "FCA"
## [78] "FLG" "FLL" "FLO" "FNT" "FSD" "FSM" "FWA" "FYV" "GCC" "GCK" "GEG"
## [89] "GFK" "GJT" "GNV" "GPT" "GRB" "GRI" "GRK" "GRR" "GSO" "GSP" "GTF"
## [100] "GTR" "GUC" "HDN" "HFD" "HHH" "HLN" "HOU" "HPN" "HRL" "HSV" "HTS"
## [111] "HVN" "HYA" "IAD" "IAG" "IAH" "ICT" "IDA" "IFP" "ILE" "ILG" "ILM"
## [122] "IND" "IPT" "ISN" "ISP" "ITH" "IYK" "JAC" "JAN" "JAX" "JFK" "JLN"
## [133] "JRB" "LAN" "LAS" "LAW" "LAX" "LBB" "LBE" "LCH" "LEB" "LEX" "LFT"
## [144] "LGA" "LGB" "LIT" "LMT" "LNK" "LRD" "LSE" "LWB" "LWS" "LYH" "MAF"
## [155] "MBS" "MCI" "MCN" "MCO" "MDT" "MDW" "MEI" "MEM" "MFE" "MFR" "MGM"
## [166] "MGW" "MHK" "MHT" "MIA" "MKC" "MKE" "MKG" "MLB" "MLI" "MLU" "MMH"
## [177] "MOB" "MOD" "MOT" "MQT" "MRY" "MSN" "MSO" "MSP" "MSY" "MTJ" "MVY"
## [188] "MYR" "NYC" "OAJ" "OAK" "OKC" "OMA" "ONT" "ORD" "ORF" "ORH" "ORL"
## [199] "OTH" "OXR" "PAH" "PBG" "PBI" "PDX" "PFN" "PGD" "PGV" "PHF" "PHL"
## [210] "PHX" "PIA" "PIE" "PIH" "PIT" "PLN" "PNS" "PQI" "PSC" "PSM" "PSP"
## [221] "PUW" "PVC" "PVD" "PVU" "PWM" "RAP" "RDD" "RDM" "RDU" "RFD" "RHI"
## [232] "RIC" "RKD" "RKS" "RNO" "ROA" "ROC" "ROW" "RST" "RSW" "SAF" "SAN"
## [243] "SAT" "SAV" "SBA" "SBN" "SBP" "SBY" "SCE" "SCK" "SDF" "SEA" "SFB"
## [254] "SFO" "SGF" "SGU" "SHV" "SJC" "SJT" "SLC" "SMF" "SMX" "SNA" "SPI"
## [265] "SPS" "SRQ" "SSI" "STC" "STL" "STS" "SUN" "SUX" "SWF" "SYR" "TEX"
## [276] "TLH" "TOL" "TPA" "TRI" "TSS" "TTN" "TUL" "TUP" "TUS" "TVC" "TWF"
## [287] "TXK" "TYR" "TYS" "UIN" "VGT" "VLD" "VPS" "WAS" "WYS" "XNA" "YKM"
## [298] "YNG" "YUM"
##
## $carrier_lg
## [1] "3M" "99" "9K" "AA" "AS" "AX" "AZ" "B6" "CO" "DH" "DL" "E9" "F9" "FL"
## [15] "G4" "GQ" "HP" "JI" "KS" "LH" "N7" "NJ" "NK" "NW" "OS" "PN" "RP" "SY"
## [29] "TW" "TZ" "U5" "UA" "US" "VX" "WN" "XP" "YX" "YY" "ZV"
##
## $carrier_low
## [1] "3M" "99" "9K" "AA" "AQ" "AS" "AX" "AZ" "B6" "BA" "CO" "DH" "DL" "E9"
## [15] "EY" "F9" "FL" "G4" "GQ" "HP" "IB" "JI" "KS" "LH" "N7" "NH" "NJ" "NK"
## [29] "NW" "OH" "OO" "OS" "PN" "RP" "SN" "SY" "TW" "TZ" "U5" "UA" "UP" "US"
## [43] "VX" "WN" "WS" "YV" "YX" "YY" "ZV"
Here is a summary of all of the variables in the dataset:
## Year quarter nsmiles airport_1
## Min. :2001 Min. :1.000 Min. : 67 DCA : 7592
## 1st Qu.:2005 1st Qu.:1.000 1st Qu.: 618 DFW : 7456
## Median :2010 Median :2.000 Median : 957 EWR : 7076
## Mean :2009 Mean :2.456 Mean :1094 IAD : 5497
## 3rd Qu.:2012 3rd Qu.:4.000 3rd Qu.:1471 IAH : 5011
## Max. :2014 Max. :4.000 Max. :2783 JFK : 4975
## (Other):118563
## airport_2 passengers fare carrier_lg
## TPA : 8116 Min. : 0.10 Min. : 17.0 DL :36376
## ORD : 7745 1st Qu.: 7.50 1st Qu.: 183.9 AA :23791
## LGA : 5724 Median : 27.14 Median : 232.7 UA :22668
## MDW : 5549 Mean : 187.65 Mean : 241.7 US :20345
## JFK : 5081 3rd Qu.: 149.00 3rd Qu.: 288.6 WN :19188
## IAD : 4632 Max. :7021.09 Max. :2566.9 CO :13027
## (Other):119323 (Other):20775
## large_ms fare_lg carrier_low lf_ms
## Min. :0.1000 Min. : 17.0 DL :33147 Min. :0.0100
## 1st Qu.:0.5090 1st Qu.: 181.2 AA :28014 1st Qu.:0.2165
## Median :0.6786 Median : 231.1 UA :19319 Median :0.4898
## Mean :0.6892 Mean : 241.4 US :19008 Mean :0.5294
## 3rd Qu.:0.8977 3rd Qu.: 289.2 WN :15789 3rd Qu.:0.8777
## Max. :1.0000 Max. :2566.9 CO :11771 Max. :1.0000
## (Other):29122
## fare_low haul
## Min. : 12.0 short : 27243
## 1st Qu.: 167.0 medium:128927
## Median : 212.5
## Mean : 223.1
## 3rd Qu.: 265.5
## Max. :2566.9
##
Across all of the airport pair markets:
What is the shortest airport pair distance?
## Source: local data frame [1 x 3]
##
## airport_1 airport_2 nsmiles
## 1 MKE ORD 67
I’m curious which airport pairs had the most passengers traveling (on average) across all quarters:
## Source: local data frame [8,028 x 3]
##
## airport_1 airport_2 mean_passengers
## 1 LAX SFO 4815.315
## 2 JFK LAX 4424.042
## 3 LGA ORD 3992.264
## 4 FLL LGA 3636.068
## 5 ATL LGA 3550.596
## 6 JFK SFO 3109.886
## 7 LAX ORD 2808.042
## 8 LAS LAX 2802.146
## 9 FLL JFK 2779.251
## 10 EWR MCO 2760.996
## .. ... ... ...
Now, I’m wanting to see what the most traveled airport pairs were for any given quarter were:
## Source: local data frame [8,028 x 3]
##
## airport_1 airport_2 max_passengers
## 1 JFK LAX 7021.087
## 2 FLL LGA 6134.340
## 3 LAX SFO 5829.239
## 4 LGA ORD 5783.587
## 5 JFK SFO 5165.326
## 6 FLL JFK 4974.610
## 7 BOS LGA 4563.407
## 8 LAS LAX 4527.000
## 9 ATL LGA 4494.505
## 10 JFK MCO 4246.630
## .. ... ... ...
What year year and quarter was the most traveled airport pair from?
## Source: local data frame [1 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare carrier_lg
## 1 2014 4 2510 JFK LAX 7021.087 412.7212 DL
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
## lf_ms (dbl), fare_low (dbl), haul (fctr)
I’m wondering now how many unique airports there are in the dataset:
## Source: local data frame [338 x 1]
##
## ap
## 1 WYS
## 2 WAS
## 3 VLD
## 4 UIN
## 5 TWF
## 6 TUP
## 7 TTN
## 8 TSS
## 9 TEX
## 10 SUX
## .. ...
So, there are 338 unique airports that exist in various airport pairs.
What are the top ten airports based on the cumulative count of passengers that traveled through them for the time period covered in the dataset?
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## 1 ORD 3648684 340.2036 216.4300
## 2 LGA 2990062 282.0813 233.5829
## 3 DFW 2729046 269.8286 226.6328
## 4 EWR 2375390 226.8326 264.0948
## 5 JFK 2167518 215.5662 233.8276
## 6 TPA 2063635 244.4486 188.5500
## 7 LAX 1977230 558.2243 263.9967
## 8 DCA 1921686 188.0871 248.5837
## 9 MDW 1751125 191.0457 194.4199
## 10 IAH 1692062 181.8444 239.3054
## .. ... ... ... ...
What are the top ten airports based on the average count of passengers that traveled through them across all quarters?
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## 1 ATL 1172222.7 1080.3896 191.9926
## 2 MCO 1100696.8 1073.8505 171.7487
## 3 DEN 848515.9 784.9361 202.0826
## 4 LAS 1200635.5 729.4262 196.1917
## 5 PHX 754172.6 635.8959 218.0236
## 6 SEA 620273.2 576.9984 243.6494
## 7 LAX 1977230.3 558.2243 263.9967
## 8 DTW 581018.9 551.7749 209.3524
## 9 MSP 576246.9 537.5437 234.5415
## 10 SAN 505210.3 511.3465 247.4843
## .. ... ... ... ...
I wonder what the most expensive airports to travel from are:
## Source: local data frame [338 x 4]
##
## ap sum_passengers avg_passengers avg_fare
## 1 TSS 4.590513 0.1995875 616.5870
## 2 JRB 1.524128 0.2177326 516.5652
## 3 IYK 2960.440939 38.4472849 480.1068
## 4 DIK 80.115942 11.4451346 456.3364
## 5 ISN 394.069703 13.1356568 452.4509
## 6 SMX 227.339377 7.8392889 424.4812
## 7 SUN 3852.807200 18.0037720 375.9384
## 8 OXR 220.545458 6.3012988 372.9604
## 9 IDA 16479.021780 42.3625239 361.4784
## 10 ASE 31951.048355 50.3959753 358.6789
## .. ... ... ... ...
TSS is the East 34th Street Heliport in NY, which technically isn’t an airport, but is included in the DOT fare data. According to Wikipedia, 72% of the flights are airtaxi.
Moving on to other variables…
Looking at fare, which is the price paid for one-way travel between an airport pair:
## 99%
## 493
The plot looks to have a good normal curve, but with a long tail. The 99th quantile of the dataset is just under 500, so it makes sense to set a limit on the x-axis to better view the histogram of fares
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 183.9 232.7 241.7 288.6 2567.0
I’m curious what the outliers look like (anything above the 99th quantile):
## Source: local data frame [1,560 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare carrier_lg
## 1 2013 2 2583 HPN MRY 0.1098901 2566.860 UA
## 2 2008 1 278 DAL XNA 0.1098901 2156.020 AA
## 3 2011 4 2508 HPN MFR 0.1086957 1824.990 UA
## 4 2008 2 1746 EGE JFK 0.2197802 1813.935 NW
## 5 2003 3 1377 BGM EFD 0.1000000 1645.000 CO
## 6 2007 2 283 JFK PHF 0.1098901 1311.990 US
## 7 2012 1 1372 DAL SBP 0.2197802 1308.550 UA
## 8 2002 4 840 EFD GSP 0.1000000 1292.000 CO
## 9 2011 1 1848 BUR MGM 0.1111111 1255.980 UA
## 10 2014 1 2292 BUR EYW 0.1111111 1248.000 DL
## .. ... ... ... ... ... ... ... ...
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
## lf_ms (dbl), fare_low (dbl), haul (fctr)
Looks like most large fares involve large distances and I’ll want to look at the relationship between fares and distance when I get to bivariate analysis. I also notice that the average passengers per day is under 1 for those city pairs. Another interesting thing to look at is the market share for those city pairs. The DAL<->XNA pair seemed odd because of the low mileage, so I looked on Kayak to see what current prices were and they were all above $500.
I am curious about what the average fare is for airport pairs that have less than 1 person / day on average to see if it is high:
## Source: local data frame [1 x 1]
##
## avgfare
## 1 296.4438
Surprisingly, it isn’t as high as I thought it would be, but does fall beyond the third quartile.
Next, I want to see a breakdown of fare by haul to see the fare distribution for short and medium flights:
## Source: local data frame [2 x 2]
##
## haul mean(fare)
## 1 short 195.2914
## 2 medium 251.4555
I can see from this graph that there are more medium flights than short flights in the dataset. It’s understable that the average fare for short flights would cost less than medium flights.
Looking at nsmiles, which is the one-way distance between airport pairs:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67 618 957 1094 1471 2783
Mileage between airport pairs looks somewhat multimodal with heteroskedasticity or possibly a skewed normal distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
If I re-scale the x-axis (log10), then I can view the long tail better:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 7.50 27.14 187.60 149.00 7021.00
The mean is larger than the median in this right-skewed histogram.
Now, let’s see a breakdown of passengers for small and medium flights to see how the feature is affected by flight length:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Source: local data frame [2 x 2]
##
## haul mean(passengers)
## 1 short 212.8344
## 2 medium 182.3225
I find it interesting to see that short haul trips have more passengers on average than medium trips (e.g. coast-to-coast travel).
Next, looking at fare_lg, which is the largest fare carrier’s average fare for the quarter:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 181.2 231.1 241.4 289.2 2567.0
The largest fares seem to closely match the average fares, which makes me think that very few low fares are sold in quantity.
So, now let’s look at fare_low, which are the lowest fare carrier’s average fare:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.0 167.0 212.5 223.1 265.5 2567.0
The distribution looks compressed, but similar to the large fare distribution, albeit with a lower mean and median.
Now, I’m curious about the difference between the average low and large fares:
## Source: local data frame [1 x 1]
##
## fare_diff
## 1 18.35838
In terms of percentage of the average fare:
## Source: local data frame [1 x 1]
##
## mean(fare_diff_percentage)
## 1 0.07011225
So, it looks like fares have a variance of 7% of the average fare between the average large and low fares.
Now, let me take a look at large_ms, which is the large fare carrier’s market share:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1000 0.5090 0.6786 0.6892 0.8977 1.0000
Seems like there is a sizable count of carriers with the largest fare that have 100% market share. I’d like to compare that to the rest:
## Source: local data frame [2 x 2]
##
## large_ms == 1 n
## 1 FALSE 143739
## 2 TRUE 12431
## n
## 1 0.08648314
So, 8.5% of all of the fare data is from carriers that have 100% market share between two airport pairs.
What is the market share distribution for large fare carriers based on trip length?
The distribution for medium length trips closely resembles the overall distribution, which is indicative of a higher quantity of medium haul trips having taken place over the reporting period.
So, what percentage of trips are short haul trips overall?
## Source: local data frame [2 x 2]
##
## haul == "short" n
## 1 FALSE 128927
## 2 TRUE 27243
## n
## 1 0.2113056
I was expecting a much lower percentage than 21%, however, it does help round out the picture of the overall market. Knowing that 79% of airport pairs are medium haul distance (> 500 miles) is corroborated by the mean statistic of nsmiles.
Now, let me take a look at lf_ms, which is the low fare carrier’s market share:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2165 0.4898 0.5294 0.8777 1.0000
This last plot is interesting, since there is a spike at 10%, so I looked at the data to verify that it was good, which it is. The reason for this spike in the data is explained by the DOT:
Note that the “lowest fare carrier” is the carrier with the lowest average fare that has at least a 10 percent share of the traffic in the market, except for markets where only a single carrier has a 10 percent or greater share.
Let’s take a closer look at this plot and what the underlying data looks like:
## Source: local data frame [7,270 x 8]
##
## Year quarter nsmiles airport_1 airport_2 carrier_low lf_ms fare_low
## 1 2001 3 696 ABE MDW CO 0.0741 179
## 2 2001 3 655 ABE ORD DL 0.0144 198
## 3 2001 3 1270 ABQ ATL TW 0.0834 143
## 4 2001 3 744 ABQ IAH AA 0.0286 117
## 5 2001 3 677 ABQ LAX HP 0.0853 97
## 6 2001 3 719 ABQ MCI CO 0.0185 99
## 7 2001 3 1122 ABQ MDW AA 0.0152 146
## 8 2001 3 889 ABQ OAK HP 0.0725 119
## 9 2001 3 610 ABQ SAT AA 0.0714 138
## 10 2001 3 718 ALB MDW CO 0.0220 109
## .. ... ... ... ... ... ... ... ...
From the above data it looks like ABQ (Albuquerque International Sunport) is dominated by a single large carrier.
Since we looked at distribution for large fare carriers based on trip length, we might as well look at the same for low fare carriers:
Nothing really stands out from this plot as significant to me. The distributions look roughly the same for both short haul and medium haul flights.
Finally, I’m curious about what percentage of carriers are both the largest and lowest fare carrier:
## Source: local data frame [1 x 1]
##
## n()/nrow(af)
## 1 0.5585964
So, 55.9% of carriers are both the largest and lowest fare carrier, which could mean many things. One possibility might be that a majority of carriers remain competitive by offering low fares in addition to being the largest fare carrier.
There are 160,258 records in the dataset with 13 features (Year, quarter, nsmiles, airport_1, airport_2, passengers, fare, carrier_lg, large_ms, fare_lg, carrier_low, lf_ms, and fare_low). The variables airport_1, airport_2, carrier_lg, and carrier_low are unordered factor variables.
Other observations:
The main features of interest to me in the dataset are fare, nsmiles, and passengers. I’d like to figure out what variables influence the fare. My guess is that distance between airports is a factor and that the activity between airports may have some weight.
Market share may have an impact on the average fare. My guess would be that areas that have high market share have less competition and may have higher fares.
At the time of my univariate analysis I had not created any variables. However, after performing bivariate analysis I thought it would be good to create an additional categorical variable, haul, for segmenting short and medium trips. A medium trip is anything over 500 miles. I then went back and performed additional univariate analysis.
Data cleaning was the bulk of the work during my investigation. I saw anomalies in market share, large and low fares, and erroneous data that needed to be marked as NA. Some variable naming changed across the years of the dataset and some fields were combined, so I had to fix those as well. I added the fixup code to the data loading functions. These changes were necessary in order to have consistent data to analyze.
Let’s start by taking a look at the correlation matrix across all features:
## Year quarter nsmiles passengers fare
## Year 1.00000000 -0.046260306 0.12873916 0.015224096 0.32166882
## quarter -0.04626031 1.000000000 -0.00439779 0.001492458 -0.03963321
## nsmiles 0.12873916 -0.004397790 1.00000000 -0.014061834 0.45878433
## passengers 0.01522410 0.001492458 -0.01406183 1.000000000 -0.18175474
## fare 0.32166882 -0.039633212 0.45878433 -0.181754745 1.00000000
## large_ms -0.02131800 -0.003712655 -0.34169886 -0.069154577 -0.14049460
## fare_lg 0.29869310 -0.038029960 0.43297986 -0.154098981 0.96016235
## lf_ms 0.08230782 -0.016324140 -0.23484039 -0.091825809 -0.07297730
## fare_low 0.33270106 -0.023622568 0.37768150 -0.177412663 0.90495618
## large_ms fare_lg lf_ms fare_low
## Year -0.021317999 0.29869310 0.08230782 0.332701064
## quarter -0.003712655 -0.03802996 -0.01632414 -0.023622568
## nsmiles -0.341698862 0.43297986 -0.23484039 0.377681505
## passengers -0.069154577 -0.15409898 -0.09182581 -0.177412663
## fare -0.140494605 0.96016235 -0.07297730 0.904956183
## large_ms 1.000000000 -0.13610928 0.68891203 0.004944203
## fare_lg -0.136109280 1.00000000 -0.11320016 0.873014613
## lf_ms 0.688912027 -0.11320016 1.00000000 0.081645398
## fare_low 0.004944203 0.87301461 0.08164540 1.000000000
Based on the correlation matrix it seems that it’s worth looking into mileage, year, and fare further:
##
## Pearson's product-moment correlation
##
## data: af$nsmiles and af$fare
## t = 204.044, df = 156168, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4548597 0.4626912
## sample estimates:
## cor
## 0.4587843
##
## Pearson's product-moment correlation
##
## data: af$Year and af$fare
## t = 134.2526, df = 156168, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3172153 0.3261082
## sample estimates:
## cor
## 0.3216688
As predicted, there is a strong positive correlation between mileage and fare, but passenger count has a negligible relationship to price. Year has a moderate positive correlation with fare.
Let me perform a plot matrix of Year, quarter, fare, nsmiles, passengers, carrier_lg, large_ms to get a sense of what the data looks like visually:
Based on the correlation values between nsmiles and fare, let’s take a look at that plot:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
The trend line (red), shows a positive correlation between mileage and fare. In other words, fare increases as mileage increases, which makes sense intuitively. The blue line shows the smoothing of the point data and the green line shows the low fare carrier’s average fare (smoothed as well).
Let’s also take a look at fare as categorized by haul (trip length):
## Source: local data frame [2 x 4]
##
## haul mean(fare) median(fare) IQR(fare)
## 1 short 195.2914 186.2011 99.94043
## 2 medium 251.4555 241.5590 102.29485
Interestingly enough, the interquartile range (IQR) is roughly the same between short and medium haul trips.
Let’s take a look at Year and fare:
From the look of the fare data plotted it looks like there was an increase in air travel for 2011-2013, which has a darker region. There’s also a average trend upward (dotted line) of fare increasing over time.
Now, let me take a look at nsmiles and large_ms:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
With the concentration of 100% market share carriers, it makes sense to separate out the 100% market share carriers and view a distribution:
So, 100% market share carriers seem to have bi-modal distribution primarily around 200 and 900 miles when looking at the density smoothing.
Continuing with a scatterplot of the rest of the large fare carriers not having 100% market share:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
It is interesting to see that carriers with the largest fare lose market share as the distance increases. My guess is that with the correlation between distance and fare that there is more competition across longer flights because there is more revenue to be gained.
Now, let’s do the same with low fare carriers by looking at nsmiles and lf_ms:
The low fare 100% market share carriers also seem to have bi-modal distribution around 200 and 900 miles.
Continuing with a scatterplot of the rest of the low fare carriers not having 100% market share:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
There is also a similar trend here where carriers lose market share as the flight distance increases, however, a little more significantly than large fare carriers.
Continuing with passengers, what’s the trend across all of the years in the dataset?
Looks like 2012 was the best year for average passengers traveling, a drop in 2013, and a return in 2014.
I wonder what we can see if we plot the fare variances across all the large fare carriers:
It’s a busy plot, however I can see that United Airlines (UA) has the largest median fare. Let me double check this is the case by computing median statistics of the large fare for all large fare carriers (sorted high-to-low):
## Source: local data frame [38 x 2]
##
## carrier_lg median_fare_lg
## 1 UA 274.6100
## 2 VX 272.1100
## 3 9K 253.2400
## 4 DL 250.8700
## 5 AA 237.7900
## 6 US 233.2503
## 7 RP 232.3300
## 8 HP 231.2700
## 9 NW 218.7000
## 10 CO 217.8100
## .. ... ...
Might as well view the plot of the top ten large carriers (based on median fare):
Let me do the same for low fare carriers (skipping the large plot):
Cape Air (9K) has the largest fare among the low fare carriers. I’ll double check by similarly computing and sorting the median statistic for low fares across all low fare carriers:
## Source: local data frame [48 x 2]
##
## carrier_low median_fare_low
## 1 9K 249.300
## 2 UA 243.630
## 3 WS 239.930
## 4 RP 228.030
## 5 AA 226.900
## 6 DL 225.290
## 7 US 224.975
## 8 VX 212.535
## 9 AS 206.640
## 10 HP 203.910
## .. ... ...
After taking a look at the airlines, I’m curious what I might find if I plot the fares across the most traveled airports:
The dotted line is the overall median of fare across all airport pairs. It’s interesting to see the average fares among the most traveled airports and see which ones are higher or lower than the median fare among the group.
So, which of the airports in the plot above have a median fare above the overall median fare?
## Source: local data frame [6 x 3]
##
## larger_ap median_fare diff
## 1 LAX 276.0004 43.257679
## 2 EWR 260.8614 28.118691
## 3 VPS 242.9361 10.193397
## 4 DCA 241.3903 8.647587
## 5 XNA 236.8914 4.148652
## 6 IAH 236.5627 3.819965
Well, it looks like if you’re in any of the cities above that you’re paying more than the median fare across all airports, but it also depends on where you’re going.Also, it’s worth mentioning that these airports serve plenty of markets, which could be a factor.
Finally, let’s take a first step towards building a linear model based on fare and nsmiles:
##
## Call:
## lm(formula = fare ~ nsmiles, data = af)
##
## Residuals:
## Min 1Q Median 3Q Max
## -252.91 -50.31 -9.09 38.81 2232.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.735e+02 3.850e-01 450.7 <2e-16 ***
## nsmiles 6.230e-02 3.053e-04 204.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75.75 on 156168 degrees of freedom
## Multiple R-squared: 0.2105, Adjusted R-squared: 0.2105
## F-statistic: 4.163e+04 on 1 and 156168 DF, p-value: < 2.2e-16
Looks like we can account for 20.96% of the fare variance with this model.
Fare correlates strongly with the distance that is being traveled. There has also been a slow trend towards increasing prices since 2005, but it might be good to compare this to inflation to see if actual costs have increased. Passengers had a negligible relationship to fare.
Based on the \(R^2\) value, distance accounts for 20.96% of the variance in fare. Other features can be incorporated into the model to better explain the variance in price, which I will build on in multivariate analysis.
Yes, market share seems to drop when distance traveled increases. I think that is because there is more competition among longer distance flights, since there is more revenue to be gained.
The strongest relationship I found was between fare and distance.
To ease into our multivariate analysis, let’s first take a look at the ratio of fare to nsmiles (bivariate), which could be considered fare value or cost per mile for fares (split by trip length):
## Source: local data frame [1 x 1]
##
## short_haul_value
## 1 0.6760771
## Source: local data frame [1 x 1]
##
## med_haul_value
## 1 0.2299902
If you’re traveling under 500 miles, then you can expect to pay $0.68 per mile on average. If you’re traveling over 500 miles, then you can can expect to pay roughly $0.23 per mile on average.
Now, I want to see a better breakdown of fare value per distance traveled:
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
##
## Formula: fare/nsmiles ~ A/sqrt(nsmiles)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## A 9.67556 0.01264 765.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1874 on 156169 degrees of freedom
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 1.987e-10
As you can see there is a lot of variance in cost per mile for travel under 500 miles. However, as a consumer if you travel larger distances you will eventually enjoy much better costs per mile.
In the plot above, the individual data points (colored in red or blueish-green) are overlayed with a smoothed conditional mean of the data in blue. The green and yellow lines are piece-wise linear fits for short and medium haul trips. Finally, the black line is a rough approximation of a non-linear fit that is essentially an inverse square root function.
With the above plot of fare value value being non-linear, it makes me want to revisit fare vs miles traveled (bivariate analysis) with different scales:
I now have a much better linear fit here. Let’s revisit the linear model and see if it improves:
##
## Call:
## lm(formula = I(log10(fare)) ~ I(sqrt(nsmiles)), data = af)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14881 -0.08252 0.00257 0.08453 1.08988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.117e+00 1.165e-03 1816.7 <2e-16 ***
## I(sqrt(nsmiles)) 7.583e-03 3.524e-05 215.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1333 on 156168 degrees of freedom
## Multiple R-squared: 0.2287, Adjusted R-squared: 0.2287
## F-statistic: 4.63e+04 on 1 and 156168 DF, p-value: < 2.2e-16
\(R^2\) went up by 1.83% with this change, so we can now better account for the relationship between fares and distance between airports.
Continuing with the concept of fare value, let’s take a look at the fare value across the largest airports (by activity):
## Source: local data frame [10 x 2]
##
## larger_ap median_value
## 1 LAX 5.741861
## 2 TPA 5.699229
## 3 JFK 4.687992
## 4 LGA 4.541571
## 5 IAH 4.095667
## 6 MDW 4.071288
## 7 EWR 4.031589
## 8 DCA 3.928532
## 9 DFW 3.915947
## 10 ORD 3.596630
Here, we can see that LAX (Los Angeles Airport) fares enjoy the most bang for the buck (i.e. miles per dollar) among the larger airports. ORD (Chicago Airport) is among the weakest in terms of miles per dollar. You could consider this if you were moving to either of these two cities and planning to travel quite a bit.
I now want to see what the fare value is across large fare carriers:
While this is a fairly busy plot, there is one airline that sticks out among the rest: National Airlines (N7). Let’s look at some of the data for the airline:
## Source: local data frame [12 x 14]
##
## Year quarter nsmiles airport_1 airport_2 passengers fare carrier_lg
## 1 2002 1 1515 LAS ORD 2074.440 142.7000 N7
## 2 2002 2 1515 LAS ORD 2441.640 139.4100 N7
## 3 2002 3 2066 IAD LAS 320.970 209.5000 N7
## 4 2002 3 1515 LAS ORD 2437.500 127.9900 N7
## 5 2001 2 2248 JFK LAS 2223.626 169.5230 N7
## 6 2001 2 1521 LAS MDW 1936.044 127.0379 N7
## 7 2001 3 1055 DFW LAS 1474.000 138.0000 N7
## 8 2001 3 2248 JFK LAS 2227.000 166.0000 N7
## 9 2001 3 1521 LAS MDW 1628.000 123.0000 N7
## 10 2001 3 2176 LAS MIA 684.000 158.0000 N7
## 11 2001 3 2177 LAS PHL 1297.000 156.0000 N7
## 12 2001 3 414 LAS SFO 2340.000 77.0000 N7
## Variables not shown: large_ms (dbl), fare_lg (dbl), carrier_low (fctr),
## lf_ms (dbl), fare_low (dbl), haul (fctr)
National Airlines gave the best mileage per dollar fares between 2001-2002, but it might explain why they are no longer around, too.
Let’s zoom in on the fare value among the most popular carriers:
## Source: local data frame [7 x 2]
##
## carrier_lg median_value
## 1 VX 7.224982
## 2 B6 6.742795
## 3 WN 5.069709
## 4 AA 4.471167
## 5 UA 4.426006
## 6 DL 3.969786
## 7 US 3.063922
Historically, it looks like Virgin America and JetBlue Airways are the two airlines that provide the best value for mileage per dollar. If you’re traveling long distances across country, it’s likely you’ll find good fares with those two airlines.
Let’s take a look at the fare value over the years (bivariate):
Over the years we have seen an increasing cost per mile traveled, but currently we’re experience a downward trend in cost per mile.
So, revisiting our linear model of log10(fare) ~ sqrt(nsmiles) + additional variables, let’s take a look at residuals:
## R-squared for model: 0.556
## [1] 0.1011344
## Don't know how to automatically pick scale for object of type AsIs. Defaulting to continuous
The individual residual data is plotted as points. The blue line is the standard deviation of all residuals, which is 0.1. The black line plot is the standard deviation at each reference mileage (or rather sqrt of mileage).
With an understanding of the strong relationship between distance and fare it made sense to me to look at other variables from that perspective. After seeing the relationship of fare value (cost per mile) to distance I thought to improve on the linear model that I had started in bivariate analysis.
The cost per mile graph was surprising in that the cost can go from $1 / mi at around a 250 mile distance to less than $0.40 / mi at around a 500 mile distance. The additional analysis into worst and best value airports and carriers was interesting to me.
Yes, I created a linear model using the log10 of fare and the square root of nsmiles. Other variables included are Year and airport pair (i.e. airport_1 and airport_2).
The variables of the linear model account for 55.6% of the variance in airline fares.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: position_dodge requires constant width: output may be incorrect
One interesting aspect of travel in general is the cost per mile for a transport mode. The cost per mile for air travel has fluctated over time and was the worst for a 14 year span (2001-2014) in 2008. Currently, we are experiencing a downward trend in cost per mile traveled as of 2014 Q4 with an inflection point in 2013. The blue line depicts the average cost per mile on a quarterly interval time series, which you can see fluctuates from quarter to quarter. The orange line is a smoothing of the same data to show trends more easily. The boxplot underneath shows the variance of cost per mile segmented yearly with the box representing the interquartile range (IQR) and the median by the horizontal bar within the plot. The median cost per mile was also the highest in 2008.
Mileage per dollar spent can be a metric for value. If you consider the mileage per dollar for airports, then you might have a sense of which airports offer good value for flying to or from, especially if you have a business where travel is frequent and you would like to minimize costs. If you consider the mileage per dollar for popular airlines, then you might have a sense of which airlines generally are going to have the best prices and ultimately, which frequent flyer programs or credit cards to join / acquire.
The plot of mileage per dollar across the ten largest airports (by activity) shows that Los Angeles International Airport (LAX) and Tampa International Airport (TPA) have the best median mileage per dollar spent. O’Hare International Airport (ORD) has the worst median mileage per dollar.
The plot of mileage per dollar across the most popular airlines shows that Virgin America and JetBlue Airways provide the best value in terms of how far you can travel per dollar. US Airways (now owned by American Airlines) is the worst in terms of mileage per dollar.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Another way of considering value with air travel is how much the cost per mile is for a given trip length. If you consider the value based on traveled miles, then you might travel farther to get the most travel for your money.
The plot of fare value (i.e. cost per mile) with one-way distance between airports roughly follows an inverse square root function. The cost per mile has high variance for travel up to 500 miles as you can see values can vary (roughly) between $1.25 (and beyond) to $0.38 after smoothing. For travel above 500 miles, the variance roughly sits between $0.38 and $0.12 after smoothing. It’s interesting to see that the distinction between short and medium haul trips by the air travel industry at 500 miles matches a “corner turn” in terms of fare value as well.
The DOT Domestic Airline Consumer Airfare Report has quarterly airfare data for 1996-2014. Unfortunately, only 2001-2014 had complete and usable data, so that is what was used by me to amass the 160,258 airport pair records. I started by simply cleaning and aggregating the data. Then, I explored single, double, and multiple variables in various ways to better understand the data. I was surprised that passenger activity did not have much weight on the predictive model of fare, but indirectly that may have been captured in the airport pair. The only strong positive correlation with fare was distance between airports, which was the basis of the linear model created. A secondary influence was Year, since there has been a general trend of increasing prices over time. The linear model improved once I realized that converting a non-linear relationship to a linear relationship (log10(fare) ~ sqrt(nsmiles) would better account for variance. Some limitations with this model are that it is averaged data instead of individual fares and that there are missing factors, such as how many days before the flight that the fare was purchased. If I had powerful computing at my disposal I would want to perform a similar analysis to this one on individual fare data.
Some interesting observations that I made while performing this analysis: